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DIRECTIONAL STATISTICS ON PERMUTATIONS 
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Abstract. Distributions over permutations arise in applications ranging from multi-object tracking to 

ranking of instances. The difficulty of dealing with these distributions is caused by the size of their domain, 

which is factorial in the number of considered entities (n!). It makes the direct definition of a multinomial 

f^ ^ distribution over permutation space impractical for all but a very small n. In this work we propose an 

^S| ' embedding of all n! permutations for a given n in a surface of a hypersphere defined in R("~^' . As 

a result of the embedding, we acquire ability to define continuous distributions over a hypersphere with 

^ ' all the benefits of directional statistics. We provide polynomial time projections between the continuous 

hypersphere representation and the n!-element permutation space. The framework provides a way to use 

continuous directional probability densities and the methods developed thereof for establishing densities over 

permutations. As a demonstration of the benefits of the framework we derive an inference procedure for a 

state-space model over permutations. We demonstrate the approach with applications. 



1-^ ' 1. Introduction 



Since the inception of the field of computer science, there has been a strong dichotomy between opti- 
mization in continuous spaces (such as IR'*) and combinatorial spaces (such as the space of permutations on 
d objects). While there are computationally hard problems in both kinds of spaces, combinatorial spaces 
^i ' are far more often the villain. It seems as if nearly all interesting learning, optimization, and representa- 

tion problems in combinatorial spaces are NP-complete in the best case. Bayesian inference in the space of 
permutations, for example, is an important, yet frustratingly difficult problem [5|. 
^ I We feel that a key factor at the heart of this dichotomy is that combinatorial spaces are far more unstruc- 

|Tr ' tured than the familiar continuous spaces. A priori, combinatorial spaces are simply sets of objects, with no 

^^ , relationship among them. Compare this to, say, Euclidean d-space, which comes equipped with a topology, 

(SJ ' continuity, completeness, compact subsets, a metric, an inner product, and so oiillO|. On these properties 

are built the entire infrastructure of analysis, including tools like the derivative [6|. In turn, the derivative 
f— ^ ' is at the heart of most optimization techniques and representations such as the Fourier basis. Essentially, 

(^ \ the last four centuries of mathematics has been developing tools for representation and optimization in con- 

tinuous spaces. Combinatorial spaces, on the other hand, have been burdened with fewer assumptions, but 
endowed with fewer advantages. 
. ; ■ One strategy for working with combinatorial spaces is to embed them into continuous spaces and work 

r> \ there with powerful analytic tools. This trick has proven to be powerful in, for example, continuous relax- 

j^ ' ations of integer programming problems [3|. It has enjoyed relatively less penetration in machine learning, 

however. And where versions of it have appeared [2,l9|, the connection to the topology and analytic properties 
of the embedding space is typically not made explicit, nor fully exploited. 

In this paper, we demonstrate the power of the embedding approach by developing a fast, accurate 
approach to Bayesian inference over permutations. Arising in tasks such as object tracking [5| or ranking [9[, 
this problem is challenging because of the factorially-large number of parameters in an exact representation 
of a general probability distribution in this space. Prior approaches have worked by approximating a general 
probability distribution with a restricted set of basis functions [J] , or by embedding the permutation space 
only implicitly, and working with a heuristically chosen probability distribution [9| . 

The paper follows the hierarchical structure of our main contributions, where each level of the hierarchy 
is split into theoretical observations and developments that make these observations practical: 

• Theoretical observations: we demonstrate an embedding of the n! permutation set onto the 
surface of a hypersphere S"^ centered at the origin in R''+^ with d = {n ~ 1)^ — 1. 



supported by NIH under grant number NCRR 1P20 RR021938. 

1 



— Observations: we propose a hypersphere embedding of permutations. 

— Practical results: we develop polynomial time transformations between the discrete n! permu- 
tation space and its continuous hypersphere representation. 

• Results that allow practical use of the theory: we demonstrate a bridge between directional 
statistics [8| and permutation sets that leads to cfhcicnt inference. 

— Observations: wc propose the von Mises-Fisher density over permutations. 

— Practical results: we develop efficient inference over permutations in a state-space model. 

* We employ analytical product and marginalization operations. 

* We show efficient transformation of partially observed permutations onto the surface of 
the hypersphere $**. 

2. Embedding permutations onto a hypersphere surface 

Among many representations of permutations in this work we are interested in the n x n permutation 
matrix representation P. Note that nowhere in the paper we are going to use this as the permutation 
operator, which is the usual intension of the matrix representation of permutations. The permutation 
matrix representation is a square bistochastic matrix with entries Py £ {0, 1}, serves more as an easy to 
interpret guide and a way to establish some required properties than an expression for a linear operator, 
whereas we interpret it in the rest of the paper merely as a vector in R" . To avoid notation clutter we 
treat all the matrices further in the paper as vectors in R" omitting the special vector stacking operation 
symbols (such as wec(-)), unless specified otherwise. 

2.1. Representation. In this section we will show how a permutation set with n! elements can be embedded 
onto the surface of a (n — 1)^ dimensional hypersphere. 

Our representation takes advantage of the geometry of the Birkhoff polytope and in part relies on the 



Birkhoff-von Neumann theorem [11|, which we state here without proof. 



Theorem 1. All n x n permutation matrices in R" are extreme points of a convex {n — 1)^ dimensional 
polytope, which is the convex hull of all bistochastic matrices. 

Next, we formulate a lemma that the rest of the section is based on: 



Lemma 1. Extreme points of the Birkhoff polytope are located on the surface of a radius \/n — 1 hypersphere 
clustered around the center of mass of all n! permutations. 

Proof. To show that the statement is valid we first compute the center of mass and then show that each 
permutation is located at an equal distance from this center. The center of mass for all the permutations on 
n objects is defined in R" as cm = ^ X]fe=i Pfe- 

We observe that the number of permutation matrices for which Pii = lis(n — 1)!, which follows from 
the effective removal of the first row and column of an n x n matrix caused by the assignment. Thus, 
Y^ Pii ~ {n ^ 1)1 which, following the same reasoning, is true for any P^ and leads to 

(1) CM ^^{n- 1)11^-1 

n\ n 

To see that all permutations are equidistant from the center of mass, we observe that ||1 — P||2 ~ \/n? ~ n 
for any P. With this observation we can compute the radius of the sphere: 



(2) 



n 



1 ,1 



(n^ — n)^r + n( 1)^ = \/n — 1 



D 



To show that the hypersphere of Leinma[T]is embedded into a space of lower dimension than R" we observe 
the following. With respect to the original formulation of permutations in R" , all of the permutations are 
located on the intersection of a hypersphere centered at the origin with ^/n radius and a hypersphere of 
Lemma [1] This intersection is still a hypersphere only with dimension lowered by one. The following lemma 
allows us to get the dimension of this hypersphere down to the one of Theorem [T] 
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Lemma 2. All permutations P are located on the intersection of 2n — 1 hyperplanes, i.e., in in — 1)^- 
dimensional affine subspace of R" . 

Proof. Let us denote by W^ j^ a.n n x n matrix with all elements except a single i*'* row of ones set to zero 
and likewise Wj^ ^ for columns. Observe that: 

(3) vec(w^-^) t;ec(P) = l ^ec (Wi^^") fec(P) = l 

for any permutation matrijo- It follows, that all permutations are located at an intersection of 2n hyperplanes 
defined by their normals: W-|^ ^ and W^ -|^, with n € {1 ■ • .ri}, and having bias of 1. This set is, however, 
not independent, because any W^ ;|^ can be expressed by a linear combination of the other 2n — 1 vectors 
by setting weights of W- ,j ;|^ to —1 and weights of W;|^ ^ to 1 for i,j G {1 . . .n}. This leads to 2n — 1 
hyperplanes whose intersection forms the space in which the hypersphere containing the Birkhoff-polytope 
is located. Thus, the dimension of the space containing the polytope is n^ — 2n + 1 = (n — 1)^. D 

All permutation matrices on n objects belong to the surface of a radius ^/n — 1 hypersphere, S'^, in 
]]^(n-i) j^g established by Lemmas [1] and [2] We do not rigorously show here, but assume that by inherent 
symmetry in the structure of permutation matrices they arc distributed evenly across the surface of S'^. 

2.2. Transformations. The representation of the previous section allows us to define and manipulate prob- 
ability density functions on $ using approaches of continuous mathematics and only then transforming 
quantities of interest back to the discrete n! permutation space. This is useful when there is a way to effi- 
ciently transform elements of one space to the other. Next wc show how this can be achieved in polynomial 
time. 

The key components posing difficulties arc discrete vs. continuous space, and the requirement of S"^ to be 
origin-centered (required for Section |3]). The former poses a considerably more challenging problem than the 
latter and absence of both would reduce the required transformations to a simple change of basis between 
R" and R*^"^^' . We develop the transformations in the proof to the following lemma. 

Lemma 3. There exist polynomial time transformations between the discrete n\ permutation space and the 
surface of the origin- centered (n — 1)^ dimensional hypersphere of radius \/n — 1. 

Proof. The transformation from a permutation space to $** requires only a short sequence of linear 
operations as it is made clear by lemmas of Section [2. II 

(1) Shift the permutation matrix P by ^1 to put the center of mass at the origin. 

(2) Change the basis by projecting into the R'""^) subspace orthogonal to W^ ■ and W^ ^. 

Since there are {n— 1)^ basis vectors of length n^, the projection operation takes 0{n'^). Note that the basis 
can be obtained by the QR factorization, which is 0{n^) in this case, but needs to be computed only once 
for a given n. 

Transforming an arbitrary point from S^ to the permutation space is more challenging. Now we 
have to linearly transform the point from S'^ to R" and then among n! possibilities find a permutation, 
that is the closest, in L2 sense, to a given point. The transformation is easily done by inverting the order 
of operations for going from R" to S"*, which amounts to 0{n'^) operations. Let us show how to efficiently 
find a permutation matrix closest to a transformed point. 

Given an arbitrary point T^ in R" , which corresponds to a point on S'^, as indicated by the superscript, 
we introduce a matrix D where 

(4) D,, = (T,^. - 1)^ 

Finding the permutation P* closest to T* amounts to finding P* that minimizes J^ij'^ij'^ij- This is the 
same as matching every column and each row to a single counterpart so that the sum of matching weights 
(elements of D) is minimal. In this case, D is an n x n edge-weight matrix for a 2n node bipartite graph 
with n elements per partition. This is the familiar minimum weighted bipartite matching problem [l3|. 
This observation allows us to apply a minimum weighted bipartite matching algorithm [13| and obtain a 



In fact, for any bistochastic matrix, as implied by Theorem [T] 
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permutation P* closest to T*. The running time of the fastest general algorithms for solving this problem 
is 0{n^ logn + n^e), where e is the number of edges in the bipartite graph. Since the number of edges in 
our case is always n, the running time effectively becomes 0{n^). However it is dominated by the time of 
projecting a point from S'^ to R" , which is O(n^) as shown before. D 

Coupling the probability representations to the transformation operations bridges the gap between the 
discrete, combinatorial space of permutations and the continuous, low-dimensional hyperspherc. This allows 
us to lift the large body of results developed for directional statistics [8| directly to permutation inference. 

3. Directional statistics 

A number of probability density functions on S'^ have been developed in the field of directional statistics [8[ . 
A detailed account is given for an interested reader in [8|, Chapter 9]. The directional statistics framework 
allows us to define quite general classes of density functions over permutations. In the rest of the paper, we 
use one of the basic models to demonstrate the usefulness of our representation and the model as well. 

3.1. von Mises-Fisher distribution. This is a m-variate von Mises-Fisheio (vMF) distribution of a m- 
dimensional vector x, where \\fj,\\ = 1, k > and m > 2: 

(5) /(a;|/x,K) = Z„(K)e'=/^^^ with normalization term Z„, (k) = —, 

where Ir[') is the r*'' order modified Besscl function of the first kind and k is called the concentration 
parameter. Examples of samples from the distribution on S^ are shown in Figure [T] 

K = 0.37131 K = 7 Aim K = 36.3869 k = 78.2916 k = 95.9959 




Figure 1. Samples of the von Mises-Fisher density function on S^ for random /x and k. 

In terms of a pdf on permutations the vMF establishes a distance-based model, where distances are 
geodesic on S''. The advantage of the formulation in a continuous space is the ability to apply a range of 
operations on the pdf and still end up with the result on S''. This advantage is realized in the inference 
procedures which we establish next. 

3.2. Efficient inference in a state space model. The results presented above establish a framework in 
which it is possible to define and manage in reasonable time probability densities over permutations. An 
important application of this framework is in the probabilistic data association (PDA) |12| . In PDA we are 
interested in maintaining links between objects and tracks under the noisy tracking conditions. Ignoring the 
underlying position estimation problem we focus on the part related to the identity management, as in [5[, 
which boils down to tracking a hidden permutation (identity assignment) under a noisy observed assignment. 
In order to perform identity tracking of permutations in the context of recursive Bayesian filtering (which 
we are going to do) wc need to define the following components: 

(1) A transition model, P(Xf|Xt_i); 

(2) An observation model, P(Yf|X() where Yt is the noisy observation of the hidden permutation 
matrix Xt; 



Sometimes also called the Langevin distribution. 



(3) A way to perform the following operations: 

(6) multiplication: F(Xt|YO oc P(Yt|XOP(Xt|Yt_i) 

(7) marginalization: P(Xt|Yt_i) = / P(Xt|Xt_i)P(Xt_i|Yt_i)dXt_i 

Avoiding transformation overhead we restrict all of the above to S'^. Hence, X and Y arc S'^ represen- 
tations of their respective hidden and observed permutations. We define both transition and observation 
models as vMF functions centered at the true permutation. Due to similarity of the vMF model to the 
multivariate Gaussian density, it seems natural to view this recursive filter as an analogy of the Kalman 
filter. In this view, the result of this sections is porting a widely successful tracking model to the discrete n! 
permutation space. 

To further stress the analogy with the Kalman filter, we show that projection operation can be com- 
puted analytically in a closed form and marginalization operation can be efficiently approximated with good 
accuracy [l|, l8|. For observation model P(Yf|Xt) oc vMF{Yt,Kobs) and posterior model P(X(|Y(_i) ex 
vM F{fj, , Kpos) the multiplication operation results in a vMF for P(Xf|Yt) parametrized as 

(8) /^t = - (^ofesYt + KposfJ'pos) '** ^ ll^obsYt + KposfJ-posW- 

In the case of a vMF transition model, the marginalization can be performed with a reasonable accuracy 
and speed using the fact that a vMF can be approximated by an angular Gaussian and performing ana- 
lytical convolution of angular Gaussian with subsequent projection back to vMF space [8| . Resulting vMF 
P(Xt|Yt_i) is parametrized as: 

(9) ^i. = Xt^i + Hp^, K = A-^{Ad{Kpos)Ad{Ktr)) Adin) = ^ '^^^ '] . 

The ratio of modified Bessel functions required for this approach can be efficiently computed with high 
accuracy by using Lentz method based on evaluating continued fractions [7| . 



3.2.1. Partial observations. Analytical computation of the 
Bayesian recursive filtering presented above relies on the fact 
that permutations are observed completely. In tracking problems ■"" 
that would mean the algorithm has to receive observations (up to ^ 
noise) of identities of every tracked object. This is a rare setting 4 

5 

6 
7 



12 3 4 5 6 7 



and most commonly observations are available only partially. 

When a partial observation of o objects becomes available, 
the dimension of the unknown part of Y is reduced from n^ to 
(n—o)'^ . The mechanism of this is shown in Figure[21 where circles 

indicate two observed objects and squares indicate the unknown Figure 2. An example of a fallback to 
parts of P. The unknown part of the representation of P on S"^ a lower dimensional permutation space 
needs to be marginalized out to obtain the likelihood used in dH). when a partial observation becomes avail- 
Figure[2]shows that this marginalization is straightforward in R" able, 
space. Unfortunately, to implement ([7]), we need to marginalize 
on the surface of the sphere, S'' C R*-""^^ - a much more difficult 
task. 

Denoting the orthogonal part of the basis in R" that represents the R^"^^'" subspace by an n^ x (n— 1)^ 
matrix Q, we project into this subspace by: 

(10) vec (Y) = Cfvec ( P 1 

In the case of a partial observation, we know which elements of the vector being projected are consistent 
with the observation and are not going to change and which elements can have any possible value. This 
allows us to split the resulting vector Y into 

(11) Y = Y, +Y?, 

where Y* and Y? respectively denote the observed and unobserved parts. 
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The likelihood with the unknown observations marginalized out becomes: 



(12) 



^KiY^cc+KiYfa^^Y-, 



1 

— f 
Z 



,K,i^i X 



3«lY'^dYv 



/Y? ^ JY? 

Some details make computing the integral in (|12p not totally trivial: a;, Y*, and Y? are of different length; 
although X is fixed, Y* and Y? are not allowed to take any possible angle in IR^"^^^ . We omit the details 
of the derivation dealing with these difficulties and just state the parameters of the resulting vMF likelihood 
function: 

(13) '^=FCfV k=|1kiY,||2 

II I*||2 

Thus, in the case of vMF we can execute a recursive Bayesian filter using only analytical computation 
even in the cases when only partially observed data is available. This makes the state space model applicable 
in a much wider range of scenarios than our initial model presented in Section ??. 

4. Experiments 



25! = 1.5511e+25 



50! = 3.0414e+64 




0.2 0.4 0.6 O.i 

average observation error 




0.2 0.4 0.6 O.S 

average observation error 



(a) 25 objects 



(b) 50 objects 



Figure 3. Average error of a random hidden permutation inference from 100 (partial) noisy 
observations on 25 and 50 objects simulated datasets. Runs were repeated 10 times with a 
different permutation. 



To demonstrate correctness of our approach, we show inference of a fixed hidden permutation from its 
noisy partial observations. Figure [3] shows results of this inference on dataset of 25 and 50 objects. In 
these first, synthetic data, experiments, we first randomly chose a true (hidden) permutation, Ptrue- We 
controlled both observation noise {v £ 0.1, 0.2, ..., 0.9) and fraction of objects missing from observations (m € 
0%, 20%, 40%, 60%). Noisy observations were drawn from vMF{Ptrue,K^), where Ki, was chosen to achieve 
v fraction of incorrectly observed object identities. The final observation, Pm, was generated by hiding 
m percent of entries from the noisy observation matrix, chosen uniformly at random without replacement. 
Figure [3] shows that our representation of the n\ discrete permutation space is functional and the approach 
can gracefully handle large number of objects, partial observations and observation noise. 

The above simulation was generated with the noise model used by the inference and did not have a 
temporal component, although it was applied to a really large state space. Next we show experiments on a 
tracking dataset with a non-vMF transition model. We use a dataset of planar locations of aircraft within 
a 30 mile diameter of John F. Kennedy airport of New York. The data, in streaming format, is available at 
[http : //www4 ■ passur . com/ j f k . htm l The complexity of the plane routes and frequent crossings of tracks 
in the planar projection make this an interesting dataset for identity tracking. Identity tracking results on 
this dataset, in the context of the symmetric semigroup approach to permutation inference, were previously 
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(a) tracking identities of 6 flights 
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(b) tracking identities of 10 flights 



Figure 4. Tracking error on the air traffic control dataset for 6 and 10 planes as a function 
of observation noise shown as the fraction of incorrectly reported planes. Separate plots 
show error for partial observations when a fraction of object identities is unobserved. 



reported in [5|. Replicating the task reported in [51, we show results on tracking datasets of 6 and 10 flights, 
dropping the 15 flights dataset (but see below). 

The dataset conies prelabeled, but the uncertainty is introduced by randomly swapping identities of flights 
i and j at their respective locations x^ and Xj with probability Pswape^Pi^W^jit) ^ ^j(0lP/(25^))- where 



-'swap 



0.1 and s = 0.1 are strength and scale parameters respectively. 



We then generated observation and hidden identity noise in the same way as for the prior experiment. 
Figure HI shows results of applying our identity tracking method to the air traffic control dataset for various 
levels of observation noise and amount of missing identity observations. It is difficult to compare the per- 
formance to the method of [5j applied to the same dataset, since it is not clear how observation noise levels 
correspond to each other. However, error values reported in [5| were 0.12 to 0.17 on the 6 flights dataset and 
0.2 to 0.32 on the 10 flights dataset. This is comparable to what we get with our approach for observation 
error below 50%, even when 60% of the flight identities are unobserved. Results of the application of our 
state space model to this dataset indicate robustness of the model to the choice of the transition model, 
which was different from the generative model of our tracking inference engine. 



41! = 3.3453e+49 





(a) a frame from the tracking task 
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(b) tracking identities of 41 players 



Figure 5. Tracking error on a football visual surveillance dataset for 41 players as a func- 
tion of observation noise. Different missing data fractions are shown. 



Due to the unmanageable size of the factorial space in identity tracking problems, even the power- 
ful and efficient methods based on Fourier representation of permutations do not report results on more 
than 11 [4| or 15 [5[ simultaneously tracked objects. The results of Figure [3] show that our approach 
can handle large numbers of objects, and Figure H] demonstrated comparable accuracy on the air traf- 
fic control dataset. Next we show results on 41 objects from a visual surveillance dataset available from 



http : //vspets . visualsurveillajice . org/. Figure [S] shows an example of the underlying data and results 



of the identity tracking. The problem is similar to the above air traffic control: we have added uncertainty to 
the players, identities using the same exponential proximity model as before. Further, unlike the air traffic 
domain, here there are very few time steps that do not involve an identity swap. This kind of situation is 
difficult for recursive Bayesian filtering in general. However, our approach handles the situation and produces 
reasonable results with acceptable error rate - indeed, quite a good error rate, considering the size of the 
state space. 

5. Conclusions 

The main result of this work is embedding permutations into a continuous manifold, thus lifting a body 
of results from directional statistics field [8[ to the fields of ranking, identity tracking and others, where 
permutations play essential role. Among many potential applications of this embedding we have chosen 
probabilistic identity tracking and were able to set up a state-space model with efficient recursive Bayesian 
filter that produced results comparable with the state of the art techniques very efficiently even on a very large 
datasets that pose difficulties to existing methods. There remains much to be done in this direction. However, 
a simple model, that can be thought of as a continuous generalization of the Mallows model [2|, [Oj, equipped 
with results from the field of directional statistics has efficiently produced results of a reasonable accuracy. 
This is promising and encourages further development of more complicated probability distributions for 
permutations: further exploration of the exponential family already developed in the field [8| as well as 
developing more complex representations using spherical harmonics representations. 
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